System and method to extract information from unstructured image documents

ABSTRACT

The present disclosure relates to a system and method to extract information from unstructured image documents. The extraction technique is content-driven and not dependent on the layout of a particular image document type. The disclosed method breaks down an image document into smaller images using the text cluster detection algorithm. The smaller images are converted into text samples using optical character recognition (OCR). Each of the text samples is fed to a trained machine learning model. The model classifies each text sample into one of a plurality of pre-determined field types. The desired value extraction problem may be converted into a question-answering problem using a pre-trained model. A fixed question is formed on the basis of the classified field type. The output of the question-answering model may be passed through a rule-based post-processing step to obtain the final answer.

RELATED APPLICATION

This application is a continuation of and claims the benefit of U.S.patent application Ser. No. 17/405,964, filed Aug. 18, 2021, entitled“System and Method to Extract Information from Unstructured ImageDocuments,” which claims the benefit of U.S. Provisional PatentApplication No. 63/067,714, filed Aug. 19, 2020, entitled, “System andMethod to Extract Information from Unstructured Image Documents,” theentirety of which is incorporated herein by reference.

TECHNICAL FIELD

The present disclosure generally relates to automatically extractingrelevant information from unstructured image documents irrespective ofwhether the layout of the image document is known.

BACKGROUND

Automatic information extraction from unstructured images is importantfor various applications, such as, workflow automation that needs totake action based on certain values in incoming messages, automatic formfilling applications that need to extract field values associated withcertain entities found in the form, and applications that convert valuesfound in unstructured images to structured data with a defined schema asin databases.

Traditional extraction methods involve passing an image documentdirectly to an optical character recognition (OCR) model. Without anyunderstanding of the layout of the document, these methods suffer inrecognizing independent chunks of information. Present disclosuretackles that problem by using an algorithm that detects text clusters.

SUMMARY

The following is a simplified summary of the disclosure in order toprovide a basic understanding of some aspects of the disclosure. Thissummary is not an extensive overview of the disclosure. It is intendedto neither identify key or critical elements of the disclosure, nordelineate any scope of the particular implementations of the disclosureor any scope of the claims. Its sole purpose is to present some conceptsof the disclosure in a simplified form as a prelude to the more detaileddescription that is presented later.

The present disclosure involves using a combination of image processingand natural language processing based on machine learning to extractknown fields from a given unstructured image document, whose layout andformat is unknown. The type of the document may be known, i.e. whetherit is an invoice or a prescription or other type of document may beknown a priori.

One aspect of the disclosure is converting an image document intomultiple smaller images based on a text cluster detection algorithm.

Another aspect of the disclosure is to use a text classification modelto classify the text clusters obtained after OCR of the smaller imagesinto one of the pre-determined fields based on the document type.

Yet another aspect of the disclosure is to convert the text extractionproblem into a question-answering problem where fixed questions areformed on the basis of the fields determined in the previous step andthe final answer is obtained by passing the output of thequestion-answering model to a field-specific rule-based filter.

Specifically, a computer-implemented method (and a system implementingthe method) is disclosed for recognizing relevant information from anunstructured image. The method comprises: receiving an unstructuredimage document as input; dividing the unstructured image document into aplurality of smaller images using an image processing technique;performing an optical character recognition (OCR) operation on theplurality of the smaller images to generate a corresponding plurality oftext outputs; classifying the plurality of text outputs using a trainedmachine learning model configured to classify text; and, using acombination of a pre-trained question-answering model and rule-basedfilters to obtain a final answer from the classified plurality of textoutputs.

The image processing technique may be a text clustering technique thatmay apply morphological transformations like dilation on one or bothaxes to generate bounding boxes around text clusters. Neighboringbounding boxes may be merged based on whether originally generatedbounding boxes could extract the desired key-value pairs, i.e. fieldtypes paired with the values. The merging decision may be based onwhether the centroid height difference between the neighboring boundingboxes are below a certain threshold.

The trained machine learning model to which the plurality of textoutputs obtained after performing the OCR operation individually on theplurality of smaller images is fed may comprise a deep neural networkmodel that learns from word sequences to classify into one or morepredetermined field types.

In a specific aspect, a system for recognizing a relevant value from anunstructured document is disclosed, where a computer processor performsthe operations of: receiving an unstructured document as input;detecting a plurality of text clusters in the unstructured document;generating, by an optical character recognition (OCR) module, aplurality of text outputs from the plurality of text clusters, whereineach text cluster corresponds to a respective text output; classifyingthe plurality of text outputs using a natural language processingalgorithm configured to classify text; using a pre-trainedquestion-answering model to obtain an initial answer from one or more ofthe classified plurality of text outputs; and, extracting a finalanswer, based on the initial answer, to be presented as an extractedvalue to be associated with a corresponding field.

BRIEF DESCRIPTION OF THE DRAWINGS

The present disclosure will be understood more fully from the detaileddescription given below and from the accompanying drawings of variousembodiments of the disclosure.

FIG. 1 is a flow diagram of the process to extract information fromunstructured images by understanding the layout of the image and thetext clusters, according to an embodiment of the present disclosure.

FIG. 2 represents a sample input image, according to an embodiment ofthe present disclosure.

FIGS. 3-5 depict few of the intermediate representations of the image inthe text cluster detection algorithm, according to an embodiment of thepresent disclosure.

Specifically, FIG. 3 depicts an intermediate snapshot of the imageprocessing layer in the text cluster detection algorithm. FIG. 4 depictsanother intermediate snapshot of the image processing layer in the textcluster detection algorithm at a later point of time compared to thesnapshot depicted in FIG. 3 . FIG. 5 depicts one of the later snapshotsof the image processing layer in the text cluster detection algorithm,i.e., at an even later point of time compared to the snapshot depictedin FIG. 4 . Each iteration dilates the image further.

FIG. 6 is the output of the text cluster detection algorithm, accordingto an embodiment of the present disclosure. Rectangular contours aredrawn for each of the text clusters found by the algorithm.

FIG. 7 depicts a working example of the text classifier model where itclassifies each of the rectangular contours into one or more of thedesired categories (key types), according to an embodiment of thepresent disclosure.

FIG. 8 depicts a working example of the question-answering algorithmwhere it frames a question to extract the final answer, according to anembodiment of the present disclosure. A question is formed based on theextracted entity from the text classification component. Thequestion-answering model outputs the associated value with the key type.

FIG. 9 illustrates the concept of compound box, according to anembodiment of the present disclosure. If the required entity types arenot found from the boxes obtained from the text cluster algorithm,compound boxes are formed by merging adjacent boxes. And then the restof the algorithmic pipeline then proceeds using the merged compoundboxes.

FIG. 10 illustrates an example machine of a computer system within whicha set of instructions, for causing the machine to perform any one ormore of the methodologies discussed herein, can be executed.

DETAILED DESCRIPTION Overview

Embodiments of the present disclosure are directed to automaticallyextracting relevant information from unstructured image documents evenwhen the layout of the image document is unknown. An algorithm, referredto as “text cluster detection algorithm” disclosed here extractsinformation after automatically understanding the document layout.

The extraction technique is content-driven and not dependent on thelayout of a particular document type, or what is the format of thedocument. The disclosed method breaks down an image document intosmaller images using the text cluster detection algorithm that can workon an unstructured image document. The smaller images are converted intotext samples using optical character recognition (OCR). Each of the textsamples is fed to a trained machine learning model. The model classifieseach text sample into one of a plurality of pre-determined field types.The desired value extraction problem may be converted into aquestion-answering problem using a pre-trained model. A fixed questionis formed on the basis of the classified field type. The output of thequestion-answering model may be passed through a rule-basedpost-processing step to obtain the final answer.

FIG. 1 is a flow diagram of an example high-level method 100 ofautomatic information extraction as implemented by a component operatingin accordance with some embodiments of the present disclosure. Themethod 100 can be performed by processing logic that can includehardware (e.g., processing device, circuitry, dedicated logic,programmable logic, microcode, hardware of a device, integrated circuit,etc.), software (e.g., instructions run or executed on a processingdevice), or a combination thereof. In some embodiments, the method 100is performed by the information extraction component 1013 shown in FIG.10 . Although shown in a particular sequence or order, unless otherwisespecified, the order of the operations can be modified. Thus, theillustrated embodiments should be understood only as examples, and theillustrated operations can be performed in a different order, while someoperations can be performed in parallel. Additionally, one or moreoperations can be omitted in some embodiments. Thus, not all illustratedoperations are required in every embodiment, and other process flows arepossible. The information extraction component can have sub-components,as described below. Each of the sub-components may involve text-based orvision-based processing.

At operation 110, an input image, such as what is shown in FIG. 2 , isprovided. This input image can be an invoice, in a non-limitingillustrative example. Though other types of input images, for example,an insurance bill, a health record, an educational institute record, ajuristic record or any other record/document can be used as the inputimage.

At operation 120, a processor in a computing machine executes thedisclosed text cluster algorithm, which is described in greater detailbelow.

At operation 130, output of the text cluster algorithm is provided to anoptical character recognition (OCR) program.

At operation 140, the output from the OCR program, i.e. the results ofrunning the OCR on the clustered texts are fed to a classificationalgorithm so that the results can be categorized into predeterminedfields (or entity) types. Note that in some embodiments neural networkmodels may be used in operation 140 to classify text. For example a deepneural network model, such as Bidirectional Encoder Representations fromTransformers (BERT) model may be used.

At operation 150, the classified texts are fed to a question-answeralgorithm in an attempt to find a final answer.

At operation 160, optionally some rule-based filters are applied to theoutput of the question-answer algorithm.

At operation 170, the final answer is output which is the extractedvalue that is automatically generated by the algorithms described in theabove operations.

Text Cluster Detection Algorithm

FIG. 2-6 depict the end to end working of the text cluster detectionalgorithm. In this algorithm, an input image is run through multiplesteps of morphological transformations in an iterative fashion. Themorphological transformations dilate the image until a minimum number ofindependent dilated clouds (similar to blobs) are formed. The clouds arethen used to form boxes in the image. Each box represents a separatetext cluster. After retaining the original boxes (known as the primaryboxes), some of the boxes may optionally be merged together based on thedifference in the center heights (i.e. location of the centroid) of theneighboring boxes (known as the secondary boxes). This is describedbelow with respect to FIG. 9 .

For certain embodiments, e.g., for certain detected types of documents,the above steps may be repeated again with slight variations. Forexample, the dilations can be biased along one of the two axes. In anexemplary embodiment, the y-axis dilations may be slowed down by auser-selected or automatically determined factor (e.g., a factor of 3)compared to the x-axis dilations, or vice versa. This scaling step maybe repeated for both the axes sequentially or in parallel. Thisrepetition results in four (4) more sets of boxes, namely, x-primary,x-secondary, y-primary and y-secondary.

One or more sets of boxes from the possible six (6) sets of boxesdescribed above (i.e. primary boxes, secondary boxes, x-primary boxes,x-secondary boxes, y-primary boxes and y-secondary boxes are thenindividually (or in combination) run through an optical characterrecognition (OCR) program to obtain an array of text samples for furtherprocessing.

Natural Language Processing

The array of text samples obtained from the OCR program are then fedinto a text classification model. This model can be a machine-learningmodel that has been trained on similar text samples to predict one ofthe predetermined field types for a particular document (or documenttype). The model can also support multi-label classification and canclassify a text sample into more than one of the known field types. Thispart of the method helps in improving the accuracy of the overallsystem. This model can be based on deep neural network.

Further Details of the Information Extraction Process Flow

FIG. 2 represents a sample input image, according to an embodiment ofthe present disclosure. Specifically, the illustrative example of inputimage shown here is an invoice showing the standard fields, such as“bill to” address, “ship to” address, Invoice number, invoice date anddetails of the shipment and pricing. This document does not have to havea familiar layout, i.e. it can be a previously unseen document. It canhave any layout or any format. This input image is fed to thealgorithmic flow 100 described in FIG. 1 .

The text cluster detection algorithm employs image processing techniquesthat generate intermediate snapshots as the algorithm progresses. FIG. 3depicts an intermediate snapshot 300 of the image processing layer inthe text cluster detection algorithm. FIG. 4 depicts anotherintermediate snapshot 400 of the image processing layer in the textcluster detection algorithm at a later point of time compared to thesnapshot depicted in FIG. 3 . FIG. 5 depicts one of the later snapshots500 of the image processing layer in the text cluster detectionalgorithm, i.e., at an even later point of time compared to the snapshotdepicted in FIG. 4 . Each iteration dilates the image further. Dilationis a morphological operation that adds pixels to the boundaries ofobjects in an image. The number of pixels added (or removed if it is anerosion process) from the objects in an image depends on the size andshape of the structuring element used to process the image.

FIG. 6 is the output of the text cluster detection algorithm, accordingto an embodiment of the present disclosure. Rectangular contours(“bounding boxes” or simply “boxes”) are drawn for each of the textclusters found by the algorithm. Boxes 605, 610, 615, 620, 625, 630,635, 640, 645, 650, 655, 660 and 665 denote text clusters. Additionalbounding boxes, such as 680 on top of the LOGO (and a similar box belowthe LOGO) may be formed within another box (e.g., box 610).

The OCR outputs on the clustered texts can be fed to a trained machinelearning model. The machine learning model can comprise a deep neuralnetwork model that learns from word sequences to classify into one ormore predetermined field types. The deep neural network model maycomprise a text classification model. The deep neural network may alsobe based on Bidirectional Encoder Representations from Transformers(BERT) model.

FIG. 7 depicts a working example of the text classification model whereit classifies each of the rectangular contours into one or more of thedesired categories (key types), according to an embodiment of thepresent disclosure. For example, the desired categories can be invoicenumber, invoice date, invoice due date (i.e. when payment is due) whenit is known that the input image was an invoice. The categories are alsoreferred to as “fields” or “field types”.

FIG. 8 depicts a working example of the question-answering algorithmwhere it frames a question to extract the final answer, according to anembodiment of the present disclosure. A question is formed based on theextracted entity from the text classification component. Thequestion-answering model outputs the associated value with the key type.In this particular example, the sample text had “Ship to” address. Themodel asked the question “what is the shipping address?” and theextracted answer is “John Smith 3787 Pineview Drive, Cambridge, MA12210.” This answer can be further refined by using rule-based filtersto separate the name “John Smith” from the street address “3787 PineviewDrive”).

FIG. 9 illustrates the concept of compound box, according to anembodiment of the present disclosure. As shown in FIG. 6 , each box(605, 610, . . . 675, 680) represents a separate text cluster. Afterretaining the original boxes (known as the primary boxes), some of theboxes may optionally be merged together based on the difference in thecenter heights (i.e. location of the centroid) of the neighboring boxes(known as the secondary boxes). If the required entity types are notfound from the boxes obtained from the text cluster algorithm, compoundboxes are formed by merging adjacent neighboring boxes.

For example, in FIG. 9 , for the primary box 630, a secondary box is635. Whether the boxes 630 and 635 are to be merged depends on whetherthe difference in absolute heights of the centroids of the boxes 630 and635 is within a user-selected or automatically determined threshold. Thecentroids for each box is at the intersection of the two diagonals. Thecentroid height threshold 910 may be a certain percentage of the heightof the boxes in consideration. For example, it may be 1% of the heightof the boxes in consideration. The centroid height difference is onlyconsidered in the vertical direction. No horizontal length is typicallybeing considered for merge decision. In a similar fashion, whether boxes650 and 655 will be merged, or whether boxes 660 and 665 should bemerged, or whether boxes 640 and 645 should be merged depends on thecentroid height difference threshold.

Once the merging decision is made, the rest of the algorithmic pipelinethen proceeds as described above, the only difference being thealgorithm now uses the merged compound boxes. The compound boxes areused to search for key-value pairs that were not extracted from theoriginal set of boxes. In other words, the decision of merging may beinvoked if the required entity types are not found from the originalboxes obtained from the text cluster detection algorithm.

FIG. 10 illustrates an example machine of a computer system 1000 withinwhich a set of instructions, for causing the machine to perform any oneor more of the methodologies discussed herein, can be executed. In someembodiments, the computer system 1000 can correspond to a host systemthat includes, is coupled to, or utilizes a memory sub-system or can beused to perform the operations of a processor (e.g., to execute anoperating system to perform operations corresponding to automaticinformation extraction, also referred to as information extractioncomponent 1013). Note Note that the information extraction component1013 may have sub-components, for example, text-cluster detectionsub-component (this can also have a neighboring boxes mergingdecision-making component), OCR sub-component, text classificationsub-component, question-answering model component, rule-based filtercomponent and an output presentation component. In alternativeembodiments, the machine can be connected (e.g., networked) to othermachines in a LAN, an intranet, an extranet, and/or the Internet. Themachine can operate in the capacity of a server or a client machine inclient-server network environment, as a peer machine in a peer-to-peer(or distributed) network environment, or as a server or a client machinein a cloud computing infrastructure or environment.

The machine can be a personal computer (PC), a tablet PC, a set-top box(STB), a Personal Digital Assistant (PDA), a cellular telephone, a webappliance, a server, a network router, a switch or bridge, or anymachine capable of executing a set of instructions (sequential orotherwise) that specify actions to be taken by that machine. Further,while a single machine is illustrated, the term “machine” shall also betaken to include any collection of machines that individually or jointlyexecute a set (or multiple sets) of instructions to perform any one ormore of the methodologies discussed herein.

The example computer system 1000 includes a processing device 1002, amain memory 1004 (e.g., read-only memory (ROM), flash memory, dynamicrandom access memory (DRAM) such as synchronous DRAM (SDRAM) or RambusDRAM (RDRAM), etc.), a static memory 1008 (e.g., flash memory, staticrandom access memory (SRAM), etc.), and a data storage system 1018,which communicate with each other via a bus 1030.

Processing device 1002 represents one or more general-purpose processingdevices such as a microprocessor, a central processing unit, or thelike. More particularly, the processing device can be a complexinstruction set computing (CISC) microprocessor, reduced instruction setcomputing (RISC) microprocessor, very long instruction word (VLIW)microprocessor, or a processor implementing other instruction sets, orprocessors implementing a combination of instruction sets. Processingdevice 1002 can also be one or more special-purpose processing devicessuch as an application specific integrated circuit (ASIC), a fieldprogrammable gate array (FPGA), a digital signal processor (DSP),network processor, or the like. The processing device 1002 is configuredto execute instructions 1028 for performing the operations and stepsdiscussed herein. The computer system 1000 can further include a networkinterface device 1008 to communicate over the network 1020.

The data storage system 1018 can include a machine-readable storagemedium 1024 (also known as a computer-readable medium) on which isstored one or more sets of instructions 1028 or software embodying anyone or more of the methodologies or functions described herein. Theinstructions 1028 can also reside, completely or at least partially,within the main memory 1004 and/or within the processing device 1002during execution thereof by the computer system 1000, the main memory1004 and the processing device 1002 also constituting machine-readablestorage media. The machine-readable storage medium 1024, data storagesystem 1018, and/or main memory 1004 can correspond to a memorysub-system.

In one embodiment, the instructions 1028 include instructions toimplement functionality corresponding to the information extractioncomponent 1013. While the machine-readable storage medium 1024 is shownin an example embodiment to be a single medium, the term“machine-readable storage medium” should be taken to include a singlemedium or multiple media that store the one or more sets ofinstructions. The term “machine-readable storage medium” shall also betaken to include any medium that is capable of storing or encoding a setof instructions for execution by the machine and that cause the machineto perform any one or more of the methodologies of the presentdisclosure. The term “machine-readable storage medium” shall accordinglybe taken to include, but not be limited to, solid-state memories,optical media, and magnetic media.

Some portions of the preceding detailed descriptions have been presentedin terms of algorithms and symbolic representations of operations ondata bits within a computer memory. These algorithmic descriptions andrepresentations are the ways used by those skilled in the dataprocessing arts to most effectively convey the substance of their workto others skilled in the art. An algorithm is here, and generally,conceived to be a self-consistent sequence of operations leading to adesired result. The operations are those requiring physicalmanipulations of physical quantities. Usually, though not necessarily,these quantities take the form of electrical or magnetic signals capableof being stored, combined, compared, and otherwise manipulated. It hasproven convenient at times, principally for reasons of common usage, torefer to these signals as bits, values, elements, symbols, characters,terms, numbers, or the like.

It should be borne in mind, however, that all of these and similar termsare to be associated with the appropriate physical quantities and aremerely convenient labels applied to these quantities. The presentdisclosure can refer to the action and processes of a computer system,or similar electronic computing device, that manipulates and transformsdata represented as physical (electronic) quantities within the computersystem's registers and memories into other data similarly represented asphysical quantities within the computer system memories or registers orother such information storage systems.

The present disclosure also relates to an apparatus for performing theoperations herein. This apparatus can be specially constructed for theintended purposes, or it can include a general purpose computerselectively activated or reconfigured by a computer program stored inthe computer. Such a computer program can be stored in a computerreadable storage medium, such as, but not limited to, any type of diskincluding floppy disks, optical disks, CD-ROMs, and magnetic-opticaldisks, read-only memories (ROMs), random access memories (RAMs), EPROMs,EEPROMs, magnetic or optical cards, or any type of media suitable forstoring electronic instructions, each coupled to a computer system bus.

The algorithms and displays presented herein are not inherently relatedto any particular computer or other apparatus. Various general purposesystems can be used with programs in accordance with the teachingsherein, or it can prove convenient to construct a more specializedapparatus to perform the method. The structure for a variety of thesesystems will appear as set forth in the description below. In addition,the present disclosure is not described with reference to any particularprogramming language. It will be appreciated that a variety ofprogramming languages can be used to implement the teachings of thedisclosure as described herein.

The present disclosure can be provided as a computer program product, orsoftware, that can include a machine-readable medium having storedthereon instructions, which can be used to program a computer system (orother electronic devices) to perform a process according to the presentdisclosure. A machine-readable medium includes any mechanism for storinginformation in a form readable by a machine (e.g., a computer). In someembodiments, a machine-readable (e.g., computer-readable) mediumincludes a machine (e.g., a computer) readable storage medium such as aread only memory (“ROM”), random access memory (“RAM”), magnetic diskstorage media, optical storage media, flash memory devices, etc.

In the specification, embodiments of the disclosure have been describedwith reference to specific example embodiments thereof. It will beevident that various modifications can be made thereto without departingfrom the broader spirit and scope of embodiments of the disclosure asset forth in the following claims. The specification and drawings are,accordingly, to be regarded in an illustrative sense rather than arestrictive sense.

What is claimed is:
 1. A system for recognizing a relevant value from anunstructured document, where a computer processor performs theoperations of: receiving an unstructured document as input; detecting aplurality of text clusters in the unstructured document; generating, byan optical character recognition (OCR) module, a plurality of textoutputs from the plurality of text clusters, wherein each text clustercorresponds to a respective text output; classifying the plurality oftext outputs using a natural language processing algorithm configured toclassify text; using a pre-trained question-answering model to obtain aninitial answer from one or more of the classified plurality of textoutputs; and extract a final answer, based on the initial answer, to bepresented as an extracted value to be associated with a correspondingfield.
 2. The system of claim 1, wherein detecting the text clusters isagnostic of a layout or format of the unstructured document.
 3. Thesystem of claim 1, wherein the unstructured document in an unstructuredimage document in its entirety, or the unstructured document has aportion that is an unstructured image document.
 4. The system of claim3, wherein each of the text clusters is a smaller image within theunstructured image document.
 5. The system of claim 1, wherein each ofthe text clusters is bounded within a contour of a respective boundingbox.
 6. The system of claim 5, where two neighboring bounding boxes aremerged based on proximity of individual neighboring bounding boxcoordinates.
 7. The system of claim 5, wherein the computer processorfurther performs the operation of: checking if a desired value isextracted from an initial set of text clusters with their respectivebounding boxes; responsive to determining that the desired value is notextracted from the initial set of text clusters with their respectivebounding boxes, creating compound bounding boxes by merging one or moreneighboring bounding boxes, thereby merging the corresponding textclusters.
 8. The system of claim 7, wherein the computer processorfurther performs the operation of: generating a new text output byperforming an optical character recognition (OCR) operation on themerged text clusters; classifying the new text output using the naturallanguage processing algorithm configured to classify text; using thepre-trained question-answering model to obtain a revised answer from theclassified new text output; and extract a revised final answer, based onthe revised initial answer, to be presented as a new extracted value tobe associated with the field.
 9. The system of claim 1, whereindetecting the plurality of text clusters uses a text cluster detectionalgorithm executed by the computer processor.
 10. The system of claim 9,wherein the text cluster detection algorithm applies a morphologicaltransformation on the bounding boxes along one or more axes.
 11. Thesystem of claim 10, wherein the morphological transformation is aniterative process that, in each iteration, creates an intermediate setof bounding boxes from a previous set of bounding boxes.
 12. The systemof claim 11, wherein in one or more iteration, the morphologicaltransformation along a selected axis applies a dilation rate that isfaster or slower than a dilation rate along another axis.
 13. The systemof claim 11, wherein relative dilation rates along different axes can bescaled up or down by predetermined factors.
 14. The system of claim 11,wherein dilation scaling can be applied to axes of the bounding boxessequentially.
 15. The system of claim 11, wherein dilation scaling canbe applied to both axes of the bounding boxes in parallel.
 16. Thesystem of claim 11, wherein the intermediate set of bounding boxes arefed to the OCR module to obtain an array of text samples for naturallanguage processing.
 17. The method of claim 1, where the pre-trainedquestion answering model is used to fetch one or more relevant answerscorresponding to each of the plurality of text outputs.
 18. The methodof claim 17, where an output of the question-answering model is passedthrough one or more rule-based filters to obtain the final answer.